R has several systems for making graphs, but ggplot2 is one of the most elegant and versatile. It implements the grammar of graphics, a coherent system for describing and building graphs.
In order to use ggplot2, we first need to load the tidyverse package.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
mpg data frame found in ggplot2.mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
Among the variables in mpg are:
displ, a car’s engine size, in litres.hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with low fuel efficiency consumes more fuel than a car with high fuel efficiency when they travel the same distance.To plot mpg:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Template for making graphs:
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = mpg). What do you see?ggplot(data = mpg)
There’s an empty grey box.
mpg? How many columns?There’s 234 rows and 11 variables (columns).
drv variable describe?drv: f = front-wheel drive, r = rear wheel drive, 4 = 4wd.
hwy vs cylggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
class vs drv? Why is the plot not useful?ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
It makes a scatter plot, however there is not enough information for this to be useful. Only portrays if a vehicle is front-wheeler, rear-wheeler, or 4-wheeler.
You can add a third variable to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. This include: size, shape, or color of the points.
For example, can map the color of the points to the class variable to reveal the class of each car.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Using size for a discrete variable is not advised.
Could use the alpha aesthetic or shape.
# Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Right
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
ggplot2 will only use 6 discrete variables at a time for shape.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
The points are not blue because blue is not a variable within mpg. As such, it does not make sense to have it inside aes(). If we want the points to turn blue, we need to write color = blue outside aes().
mpg are categorical? Which variables are continuous? How can you see this information when you run mpg?In order to see this information: ?mpg, str(mpg), or go through the entire data mpg.
color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = year))
With categorical variables the aesthetic affects/changes points that are not together.
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = hwy, color = hwy))
By doing so, we get a a positively slope line with a 45 degree angle.
stroke aesthetic do? What shapes does it work with?The stroke aesthetic increases the size of the border or stroke of the point.
aes(color = displ < 5)?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))
It makes a boolen (True/False). The colors are different for the set rule.
Split plot into facets (subplots that each display one subset of the data).
To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
To facet your plot on the combination of two variables, add facet_grid().
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cty)
The number of subplots/facets is too big.
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
The empty cells just mean that there are some type of vehicles that do not have that number of cylinders or vice versa.
. do?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
The code makes the a set of subplots with the type of vehicle as the subplot and the the number of cylinders too. For the first code it separates it in different rows, while the latter does it in different columns.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
By using faceting one can see in more detail the mileage per different vehicle. The disadvantage is that it is harder to compare. With a larger dataset the faceting might not be useful in some cases as the amount of points will be too much.
?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of individual panes? Why doesn’t facet_grid() have nrow and ncol argument??facet_wrap
nrow and ncol determine the number of rows and columns that we want the facets to be divided into. facet_grid() does not have these arguments because the number of rows and columns are determined by the categorical variables.
facet_grid() you should usually put the variable with more unique levels in the columns. Why?It makes the data more understandable.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'
Every geom function in ggplot2 takes a mapping argument. Can use linetype.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess'
To display multiple geoms in the same plot, add multiple geom functions to ggplot():
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?show.legend = FALSE prevents the legend to be shown in the graph. If you remove it the legend is shown.
se argument to geom_smooth() do?It asks if you want to display the confidence interval around the line/smooth (TRUE by default).
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy))
These two graphs will look the same as they are graphing the exact same thing. The only difference is that the code is repeated inside geom_point() and geom_smooth().
# 1
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
# 2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = drv)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess'
# 3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
# 4
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = drv)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
# 5
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = drv)) +
geom_smooth(mapping = aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'
# 6
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size = 4, colour = "white") +
geom_point(aes(colour = drv))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Every geom has a default stat. There are three reasons why you might need to use a stat explicitly:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
3. Might want to draw greater attention to the statistical transformation in the code.
stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?The default geom is geom_pointrange().
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median)
geom_col() do? How is it different to geom_bar()?geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col instead.
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
What variables does stat_smooth() compute? What parameters control its behaviour?
stat_smooth() calculates:
y: predicted valueymin: lower pointwise confidence interval around the meanymax: upper pointwise confidence interval around the meanse: standard error
In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
If we fail to set group = 1, the proportions for each cut are calculated using the complete dataset, rather thane ach subset of cut. Instead, we want the graphs to look like this:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
There’s one more piece of magic associated with bar charts. You can colour a bar chart using the fill aesthetic.
In geom_bar() the stacking is performed automatically by the position adjustment specified by the position argument. Can use three other options:
position = "identity" will place each object exactly where it falls in the context of the graph (not very useful for bars, because it overlaps them).
position = "fill" works like stacking, but makes each set of stacked bars the same height.
position = "dodge" places overlapping objects directly beside one another. Makes it clear to compare individual values.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
position = "jitter" is useful for scatter plots.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
Many of the data points overlap. We can jitter the points by adding some slight random noise, which will improve the overall visualization.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
geom_jitter() control the amount of jittering?width and height
geom_jitter() with geom_count().geom_count() depicts the number of observations in an specific point, while geom_jitter() adds random noise.
geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.position = "dodge" is the default position for geom_boxplot().
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy, color = drv))
coord_flip() switches the \(x\) and \(y\) axes. Useful if you want horizontal boxplots and for long labels.
coord_quickmap() sets the aspect ratio correctly for maps. Very important if you’re plotting spatial data with ggplot2.
nz <- map_data("nz")
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_polar() uses polar coordinates. They reveal an interesting connection between a bar chart and a Coxcomb chart.bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
coord_polar().ex1 <- ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar()
ex1 + coord_polar()
labs() do?Ensure the axes and legend labesl display the full variable name. Adds labels to the graph. You can add a title, subtitle, and a label for the \(x\) and \(y\) axes, as well as a caption.
coord_quickmap() and coord_map()?coord_map projects a portion of the earth, which is approximately spherical, onto a flat 2D plane. coord_quickmap is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator.
coord_fixed() important What does `geom_abline() do?coord_fixed() the plot draws equal intervals on the \(x\) and \(y\) axes so they are directly comparable. geom_abline() draws a line that, by default, has an intercept of 0 and slope of 1. This aids us in our discovery that automobile gas efficiency is on average slightly higher for highways than city driving. Though the slope of the relationship is still roughly 1-to-1.You can create new objects with <-:
x <- 3 * 4
All R statements where you create objects, assignemnet statements, have the same form:
object_name <- value
Recommend using snake_case where you separete lowercase words with _.
This is how object/function names should look: i_use_snake_case
R has a large collection of built-in functions that are called like this: function_name(arg1 = val1, arg2 = val2, ...)
To create and call out an object:
(y <- seq(1,10, length.out = 5))
## [1] 1.00 3.25 5.50 7.75 10.00
my_variable <- 10
#my_varıable
It does not work because my_variable is misspelled the second time.
library(tidyverse)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
filter(mpg, cyl == 8)
## # A tibble: 70 x 11
## manufacturer model displ year cyl trans drv
## <chr> <chr> <dbl> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto(s6) 4
## 2 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r
## 3 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r
## 4 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r
## 5 chevrolet c1500 suburban 2wd 5.7 1999 8 auto(l4) r
## 6 chevrolet c1500 suburban 2wd 6.0 2008 8 auto(l4) r
## 7 chevrolet corvette 5.7 1999 8 manual(m6) r
## 8 chevrolet corvette 5.7 1999 8 auto(l4) r
## 9 chevrolet corvette 6.2 2008 8 manual(m6) r
## 10 chevrolet corvette 6.2 2008 8 auto(s6) r
## # ... with 60 more rows, and 4 more variables: cty <int>, hwy <int>,
## # fl <chr>, class <chr>
filter(diamonds, carat > 3)
## # A tibble: 32 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58 8040 9.10 8.97 5.67
## 2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58 12300 9.44 9.40 5.85
## 9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
## 10 3.50 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
## # ... with 22 more rows
library(nycflights13)
library(tidyverse)
**Tibbles*: data frames, but slightly tweaked to work better in the tidyverse.
int: integersdbl: doubles, or real numberschr: character vectors, or stringsdttm: date-times (a date + a time)lgl: logical, TRUE or FALSEfctr: factors, R uses it to represent categorical variables with fixed possible valuesdate: datesfilter(): pick observations by their valuesarrange(): reorder the rowsselect(): pick variables by their namesmutate(): create new variables with functions of existing variablessummarise(): collapse many values down to a single summary.Filter flights for the flights that happened on January 1st.
jan1 <- filter(flights, month == 1, day == 1)
Flights on December 25 (Christmas)
(dec25 <- filter(flights, month == 12, day == 25))
## # A tibble: 719 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 25 456 500 -4 649
## 2 2013 12 25 524 515 9 805
## 3 2013 12 25 542 540 2 832
## 4 2013 12 25 546 550 -4 1022
## 5 2013 12 25 556 600 -4 730
## 6 2013 12 25 557 600 -3 743
## 7 2013 12 25 557 600 -3 818
## 8 2013 12 25 559 600 -1 855
## 9 2013 12 25 559 600 -1 849
## 10 2013 12 25 600 600 0 850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
& is “and”, | is “or”, and ! is “not”.
The following code finds all the flights that departed in November or December:
filter(flights, month == 11 | month == 12)
## # A tibble: 55,403 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 11 1 5 2359 6 352
## 2 2013 11 1 35 2250 105 123
## 3 2013 11 1 455 500 -5 641
## 4 2013 11 1 539 545 -6 856
## 5 2013 11 1 542 545 -3 831
## 6 2013 11 1 549 600 -11 912
## 7 2013 11 1 550 600 -10 705
## 8 2013 11 1 554 600 -6 659
## 9 2013 11 1 554 600 -6 826
## 10 2013 11 1 554 600 -6 749
## # ... with 55,393 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
nov_dec <- filter(flights, month %in% c(11,12))
Can simplify complicated subsetting by remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.
Find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
If you want to determine if a value is missing, use is.na():
x <- NA
is.na(x)
## [1] TRUE
library(nycflights13)
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 811 630 101 1047
## 2 2013 1 1 848 1835 853 1001
## 3 2013 1 1 957 733 144 1056
## 4 2013 1 1 1114 900 134 1447
## 5 2013 1 1 1505 1310 115 1638
## 6 2013 1 1 1525 1340 105 1831
## 7 2013 1 1 1549 1445 64 1912
## 8 2013 1 1 1558 1359 119 1718
## 9 2013 1 1 1732 1630 62 2028
## 10 2013 1 1 1803 1620 103 2008
## # ... with 10,190 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
IAH or HOU)filter(flights, dest %in% c('IAH','HOU'))
## # A tibble: 9,313 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 623 627 -4 933
## 4 2013 1 1 728 732 -4 1041
## 5 2013 1 1 739 739 0 1104
## 6 2013 1 1 908 908 0 1228
## 7 2013 1 1 1028 1026 2 1350
## 8 2013 1 1 1044 1045 -1 1352
## 9 2013 1 1 1114 900 134 1447
## 10 2013 1 1 1205 1200 5 1503
## # ... with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 623 627 -4 933
## 4 2013 1 1 728 732 -4 1041
## 5 2013 1 1 739 739 0 1104
## 6 2013 1 1 908 908 0 1228
## 7 2013 1 1 1028 1026 2 1350
## 8 2013 1 1 1044 1045 -1 1352
## 9 2013 1 1 1114 900 134 1447
## 10 2013 1 1 1205 1200 5 1503
## # ... with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, carrier %in% c("UA","AA","DL"))
## # A tibble: 139,504 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 554 600 -6 812
## 5 2013 1 1 554 558 -4 740
## 6 2013 1 1 558 600 -2 753
## 7 2013 1 1 558 600 -2 924
## 8 2013 1 1 558 600 -2 923
## 9 2013 1 1 559 600 -1 941
## 10 2013 1 1 559 600 -1 854
## # ... with 139,494 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, month %in% c(6,7,8))
## # A tibble: 86,995 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 6 1 2 2359 3 341
## 2 2013 6 1 451 500 -9 624
## 3 2013 6 1 506 515 -9 715
## 4 2013 6 1 534 545 -11 800
## 5 2013 6 1 538 545 -7 925
## 6 2013 6 1 539 540 -1 832
## 7 2013 6 1 546 600 -14 850
## 8 2013 6 1 551 600 -9 828
## 9 2013 6 1 552 600 -8 647
## 10 2013 6 1 553 600 -7 700
## # ... with 86,985 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, arr_delay >= 120 & dep_delay <= 0)
## # A tibble: 29 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 27 1419 1420 -1 1754
## 2 2013 10 7 1350 1350 0 1736
## 3 2013 10 7 1357 1359 -2 1858
## 4 2013 10 16 657 700 -3 1258
## 5 2013 11 1 658 700 -2 1329
## 6 2013 3 18 1844 1847 -3 39
## 7 2013 4 17 1635 1640 -5 2049
## 8 2013 4 18 558 600 -2 1149
## 9 2013 4 18 655 700 -5 1213
## 10 2013 5 22 1827 1830 -3 2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, dep_delay >=60 & (dep_delay - arr_delay) <=60)
## # A tibble: 26,772 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 811 630 101 1047
## 2 2013 1 1 826 715 71 1136
## 3 2013 1 1 848 1835 853 1001
## 4 2013 1 1 957 733 144 1056
## 5 2013 1 1 1114 900 134 1447
## 6 2013 1 1 1120 944 96 1331
## 7 2013 1 1 1301 1150 71 1518
## 8 2013 1 1 1337 1220 77 1649
## 9 2013 1 1 1400 1250 70 1645
## 10 2013 1 1 1505 1310 115 1638
## # ... with 26,762 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, dep_time >= 0, dep_time <= 600)
## # A tibble: 9,344 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 9,334 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?It’s a shorcut for x >= left & x <= right. It can be used to answer the previous questions in a simpler manner.
For example:
filter(flights, month >= 7, month <= 9)
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 1 1 2029 212 236
## 2 2013 7 1 2 2359 3 344
## 3 2013 7 1 29 2245 104 151
## 4 2013 7 1 43 2130 193 322
## 5 2013 7 1 44 2150 174 300
## 6 2013 7 1 46 2051 235 304
## 7 2013 7 1 48 2001 287 308
## 8 2013 7 1 58 2155 183 335
## 9 2013 7 1 100 2146 194 327
## 10 2013 7 1 100 2245 135 337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 1 1 2029 212 236
## 2 2013 7 1 2 2359 3 344
## 3 2013 7 1 29 2245 104 151
## 4 2013 7 1 43 2130 193 322
## 5 2013 7 1 44 2150 174 300
## 6 2013 7 1 46 2051 235 304
## 7 2013 7 1 48 2001 287 308
## 8 2013 7 1 58 2155 183 335
## 9 2013 7 1 100 2146 194 327
## 10 2013 7 1 100 2245 135 337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
dep_time? What other variables are missing? What might these rows represent?filter(flights, is.na(dep_time))
## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
There are 8,245 flights missing dep_time. There are also missing values for arrival time and departure/arrival delay. These flights were most likely cancelled.
NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample)NA ^ 0: by definition anything to the 0th power is 1.
NA | TRUE: as long as one condition is TRUE, the result is TRUE.
FALSE & NA: NA indicates the absence of a value, so the conditional expression ignores it.
arrange()arrange() works similarly to filter() except that instead of selecting rows, it changes their order.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Can use desc() to re-order by a column in descending order:
arrange(flights, desc(arr_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 7 22 2257 759 898 121
## 9 2013 12 5 756 1700 896 1058
## 10 2013 5 3 1133 2055 878 1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Missing values are always sorted at the end:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2
## 2 5
## 3 NA
arrange(df, desc(x))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 5
## 2 2
## 3 NA
arrange() to sort all missing values to the start? (Hint: use is.na()).# arrange(data, !is.na(.))
A working example:
arrange(flights, !is.na(dep_time))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
flights to find the most delayed flights. Find the flights that left earliest.# Based on arrival delay
arrange(flights, desc(arr_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 7 22 2257 759 898 121
## 9 2013 12 5 756 1700 896 1058
## 10 2013 5 3 1133 2055 878 1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
# Flights that left earliest
arrange(flights, dep_delay)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 7 2040 2123 -43 40
## 2 2013 2 3 2022 2055 -33 2240
## 3 2013 11 10 1408 1440 -32 1549
## 4 2013 1 11 1900 1930 -30 2233
## 5 2013 1 29 1703 1730 -27 1947
## 6 2013 8 9 729 755 -26 1002
## 7 2013 10 23 1907 1932 -25 2143
## 8 2013 3 30 2030 2055 -25 2213
## 9 2013 3 2 1431 1455 -24 1601
## 10 2013 5 5 934 958 -24 1225
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
flights to find the fastest flights.# Based on speed
arrange(flights, desc(distance/air_time))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 5 25 1709 1700 9 1923
## 2 2013 7 2 1558 1513 45 1745
## 3 2013 5 13 2040 2025 15 2225
## 4 2013 3 23 1914 1910 4 2045
## 5 2013 1 12 1559 1600 -1 1849
## 6 2013 11 17 650 655 -5 1059
## 7 2013 2 21 2355 2358 -3 412
## 8 2013 11 17 759 800 -1 1212
## 9 2013 11 16 2003 1925 38 17
## 10 2013 11 16 2349 2359 -10 402
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
# Longest
arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 857 900 -3 1516
## 2 2013 1 2 909 900 9 1525
## 3 2013 1 3 914 900 14 1504
## 4 2013 1 4 900 900 0 1516
## 5 2013 1 5 858 900 -2 1519
## 6 2013 1 6 1019 900 79 1558
## 7 2013 1 7 1042 900 102 1620
## 8 2013 1 8 901 900 1 1504
## 9 2013 1 9 641 900 1301 1242
## 10 2013 1 10 859 900 -1 1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
# Shortest
arrange(flights, distance)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 1 3 2127 2129 -2 2222
## 3 2013 1 4 1240 1200 40 1333
## 4 2013 1 4 1829 1615 134 1937
## 5 2013 1 4 2128 2129 -1 2218
## 6 2013 1 5 1155 1200 -5 1241
## 7 2013 1 6 2125 2129 -4 2224
## 8 2013 1 7 2124 2129 -5 2212
## 9 2013 1 8 2127 2130 -3 2304
## 10 2013 1 9 2126 2129 -3 2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
select()select() allows you to rapidly zoom in on a useful subset using operations based on the names of variables.
# Select columns by name
select(flights, year, month, day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns between year and day
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## 5 554 600 -6 812 837 -25
## 6 554 558 -4 740 728 12
## 7 555 600 -5 913 854 19
## 8 557 600 -3 709 723 -14
## 9 557 600 -3 838 846 -8
## 10 558 600 -2 753 745 8
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
There are a number of helper functions you can use within select():
starts_with("abc")ends_with("xyz")contains("ijk")matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. More in strings. *num_range("x", 1:3) matches x1, x2 and x3.Use everything() if there’s a handful of variables that you want to move to the start of the data frame.
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
dep_time, dep_delay, arr_time, and arr_delay from flights.select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
select(flights, ends_with("time"), ends_with("delay"))
## # A tibble: 336,776 x 7
## dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay
## <int> <int> <int> <int> <dbl> <dbl>
## 1 517 515 830 819 227 2
## 2 533 529 850 830 227 4
## 3 542 540 923 850 160 2
## 4 544 545 1004 1022 183 -1
## 5 554 600 812 837 116 -6
## 6 554 558 740 728 150 -4
## 7 555 600 913 854 158 -5
## 8 557 600 709 723 53 -3
## 9 557 600 838 846 140 -3
## 10 558 600 753 745 138 -2
## # ... with 336,766 more rows, and 1 more variables: arr_delay <dbl>
select(flights, contains("delay"))
## # A tibble: 336,776 x 2
## dep_delay arr_delay
## <dbl> <dbl>
## 1 2 11
## 2 4 20
## 3 2 33
## 4 -1 -18
## 5 -6 -25
## 6 -4 12
## 7 -5 19
## 8 -3 -14
## 9 -3 -8
## 10 -2 8
## # ... with 336,766 more rows
one_of() function do? Why might it be useful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
# one_of() selects any variable which matches one of the strings in the vector
select(flights, one_of(vars))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with 336,766 more rows
select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows, and 1 more variables: time_hour <dttm>
By default the select helpers ignore case. To adhere to case, set ignore.case = FALSE in the helper function. For example
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 x 0
mutate()mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables. Remember that when you’re in RStudio, the easiest way to sell all the columns is ’View()`.
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time)
mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 370.0441
## 2 2013 1 1 4 20 1416 227 16 374.2731
## 3 2013 1 1 2 33 1089 160 31 408.3750
## 4 2013 1 1 -1 -18 1576 183 -17 516.7213
## 5 2013 1 1 -6 -25 762 116 -19 394.1379
## 6 2013 1 1 -4 12 719 150 16 287.6000
## 7 2013 1 1 -5 19 1065 158 24 404.4304
## 8 2013 1 1 -3 -14 229 53 -11 259.2453
## 9 2013 1 1 -3 -8 944 140 -5 404.5714
## 10 2013 1 1 -2 8 733 138 10 318.6957
## # ... with 336,766 more rows
Note that you can refer to columns that you’ve just created:
mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
## year month day dep_delay arr_delay distance air_time gain hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 3.7833333
## 2 2013 1 1 4 20 1416 227 16 3.7833333
## 3 2013 1 1 2 33 1089 160 31 2.6666667
## 4 2013 1 1 -1 -18 1576 183 -17 3.0500000
## 5 2013 1 1 -6 -25 762 116 -19 1.9333333
## 6 2013 1 1 -4 12 719 150 16 2.5000000
## 7 2013 1 1 -5 19 1065 158 24 2.6333333
## 8 2013 1 1 -3 -14 229 53 -11 0.8833333
## 9 2013 1 1 -3 -8 944 140 -5 2.3333333
## 10 2013 1 1 -2 8 733 138 10 2.3000000
## # ... with 336,766 more rows, and 1 more variables: gain_per_hour <dbl>
If you only want to keep the new variables, use transmute():
transmute(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 9 3.7833333 2.378855
## 2 16 3.7833333 4.229075
## 3 31 2.6666667 11.625000
## 4 -17 3.0500000 -5.573770
## 5 -19 1.9333333 -9.827586
## 6 16 2.5000000 6.400000
## 7 24 2.6333333 9.113924
## 8 -11 0.8833333 -12.452830
## 9 -5 2.3333333 -2.142857
## 10 10 2.3000000 4.347826
## # ... with 336,766 more rows
+, -, *, /, ^.x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.%/% integer division%% remainder, where x = y * (x %/% y) + (x %% y). Allows you to break integers into pieces.transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # ... with 336,766 more rows
log(), log2(), log10(). In log2() a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.lead() and lag().(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
*Cummulative and rolling aggregates: - cumsum() - cumprod() - cummin() - cummax() - and dplyr provides cummean() for cummulative means
x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
min_rank().y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
If min_rank() doesn’t do what you need, look at the variants:
row_number(y)
## [1] 1 2 3 NA 4 5
dense_rank(y)
## [1] 1 2 2 NA 3 4
percent_rank(y)
## [1] 0.00 0.25 0.25 NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6 NA 0.8 1.0
dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not continuous numbers. Convert them to a more convenient represenation of number of minutes since midnight.transmute(flights,
dep_time = (dep_time %/% 100) * 60 + dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 + sched_dep_time %% 100)
## # A tibble: 336,776 x 2
## dep_time sched_dep_time
## <dbl> <dbl>
## 1 317 315
## 2 333 329
## 3 342 340
## 4 344 345
## 5 354 360
## 6 354 358
## 7 355 360
## 8 357 360
## 9 357 360
## 10 358 360
## # ... with 336,766 more rows
air_time with arr_time - dep_time. What do you expect to see? What do you expect to see? What do you need to do to fix it?transmute(flights,
air_time,
arr_time,
dep_time,
air_time_new = arr_time - dep_time)
## # A tibble: 336,776 x 4
## air_time arr_time dep_time air_time_new
## <dbl> <int> <int> <int>
## 1 227 830 517 313
## 2 227 850 533 317
## 3 160 923 542 381
## 4 183 1004 544 460
## 5 116 812 554 258
## 6 150 740 554 186
## 7 158 913 555 358
## 8 53 709 557 152
## 9 140 838 557 281
## 10 138 753 558 195
## # ... with 336,766 more rows
They are not the same because dep_time and arr_time are not measured in minutes, but are numberical represenations of the time. we need to convert them to continuous numbers like above to make the correct calculation for air_time.
transmute(flights,
air_time,
arr_time = (arr_time %/% 100) * 60 + arr_time %% 100,
dep_time = (dep_time %/% 100) * 60 + dep_time %% 100,
air_time_new = arr_time - dep_time)
## # A tibble: 336,776 x 4
## air_time arr_time dep_time air_time_new
## <dbl> <dbl> <dbl> <dbl>
## 1 227 510 317 193
## 2 227 530 333 197
## 3 160 563 342 221
## 4 183 604 344 260
## 5 116 492 354 138
## 6 150 460 354 106
## 7 158 553 355 198
## 8 53 429 357 72
## 9 140 518 357 161
## 10 138 473 358 115
## # ... with 336,766 more rows
dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?I expect that dep_delay = dep_time - sched_dep_time.
transmute(flights,
dep_time,
sched_dep_time,
dep_delay,
dep_delay_new = dep_time - sched_dep_time)
## # A tibble: 336,776 x 4
## dep_time sched_dep_time dep_delay dep_delay_new
## <int> <int> <dbl> <int>
## 1 517 515 2 2
## 2 533 529 4 4
## 3 542 540 2 2
## 4 544 545 -1 -1
## 5 554 600 -6 -46
## 6 554 558 -4 -4
## 7 555 600 -5 -45
## 8 557 600 -3 -43
## 9 557 600 -3 -43
## 10 558 600 -2 -42
## # ... with 336,766 more rows
min_rank().delayed <- mutate(flights, most_delayed = min_rank(desc(arr_delay)))
select(delayed, flight, most_delayed) %>%
arrange(most_delayed)
## # A tibble: 336,776 x 2
## flight most_delayed
## <int> <int>
## 1 51 1
## 2 3535 2
## 3 3695 3
## 4 177 4
## 5 3075 5
## 6 2391 6
## 7 2119 7
## 8 2047 8
## 9 172 9
## 10 3744 10
## # ... with 336,766 more rows
1:3 + 1:10 return? Why?1:3 + 1:10
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
## [1] 2 4 6 5 7 9 8 10 12 11
Because the two vectors are not the same lenght, R **recycles* the shorter one until each vector is the same length. Then R adds the first elements together, then the second elements, and so on.
Cosine, sine, tangent, arc-tangent, arc-sine, arc-tangent, and the two-argument arc-tangent.
summarise()summarise() is the last key verb. It collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.63907
summarise() is not very useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups.
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.548926
## 2 2013 1 2 13.858824
## 3 2013 1 3 10.987832
## 4 2013 1 4 8.951595
## 5 2013 1 5 5.732218
## 6 2013 1 6 7.148014
## 7 2013 1 7 5.417204
## 8 2013 1 8 2.553073
## 9 2013 1 9 2.276477
## 10 2013 1 10 2.844995
## # ... with 355 more rows
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dest != "HNL")
# It looks like delays increse with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
A simpler way to tackle this:
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dest != "HNL")
Behind the scenes: x %>% f(y) -> f(x,y) and x %>% f(y) %>% g(z) -> g(f(x,y), z)
How to remove missing values. For example, by removing the cancelled flights we can obtain the most exact mean for delays for a day.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.435620
## 2 2013 1 2 13.677802
## 3 2013 1 3 10.907778
## 4 2013 1 4 8.965859
## 5 2013 1 5 5.732218
## 6 2013 1 6 7.145959
## 7 2013 1 7 5.417204
## 8 2013 1 8 2.558296
## 9 2013 1 9 2.301232
## 10 2013 1 10 2.844995
## # ... with 355 more rows
Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or aocunt of non-missing values (sum(!is.na(x))). This way you can check that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (id by tail number) that have the highest average delays:
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
Can put everything together by using a combination of %>% and +.
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
mean(x) and median(x).not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.6510229 32.48156
## 2 2013 1 2 12.6928879 32.02991
## 3 2013 1 3 5.7333333 27.66087
## 4 2013 1 4 -1.9328194 28.30976
## 5 2013 1 5 -1.5258020 22.55882
## 6 2013 1 6 4.2364294 24.37270
## 7 2013 1 7 -4.9473118 27.76132
## 8 2013 1 8 -3.2275785 20.78909
## 9 2013 1 9 -0.2642777 25.63415
## 10 2013 1 10 -5.8988159 27.34545
## # ... with 355 more rows
sd(x), IQR(x), mad(x).# Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
## # A tibble: 104 x 2
## dest distance_sd
## <chr> <dbl>
## 1 EGE 10.542765
## 2 SAN 10.350094
## 3 SFO 10.216017
## 4 HNL 10.004197
## 5 SEA 9.977993
## 6 LAS 9.907786
## 7 PDX 9.873299
## 8 PHX 9.862546
## 9 LAX 9.657195
## 10 IND 9.458066
## # ... with 94 more rows
min(x), quantile(x, 0.25), max(x). quantile(x, 0.25) will find a value of x that is gretaer than 25%, and les than the remaining 75%.# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day first last
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
first(x), nth(x,2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist.not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
# These functions are complementary to filtering on ranks.
# Filtering gives you all variables, with each obs in a separate row
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
## # A tibble: 770 x 20
## # Groups: year, month, day [365]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 2356 2359 -3 425
## 3 2013 1 2 42 2359 43 518
## 4 2013 1 2 2354 2359 -5 413
## 5 2013 1 3 32 2359 33 504
## 6 2013 1 3 2349 2359 -10 434
## 7 2013 1 4 25 2359 26 505
## 8 2013 1 4 2358 2359 -1 429
## 9 2013 1 4 2358 2359 -1 436
## 10 2013 1 5 14 2359 15 503
## # ... with 760 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, r <int>
n(), sum(!is.na(x)), and n_distinct(x) counts the number of unique values.# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## # A tibble: 104 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ... with 94 more rows
sum(x > 10), mean(y == 0).# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 1 1 0
## 2 2013 1 2 3
## 3 2013 1 3 4
## 4 2013 1 4 3
## 5 2013 1 5 3
## 6 2013 1 6 2
## 7 2013 1 7 2
## 8 2013 1 8 1
## 9 2013 1 9 3
## 10 2013 1 10 3
## # ... with 355 more rows
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # ... with 355 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
## # A tibble: 12 x 3
## # Groups: year [?]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
## 7 2013 7 29425
## 8 2013 8 29327
## 9 2013 9 27574
## 10 2013 10 28889
## 11 2013 11 27268
## 12 2013 12 28135
(per_year <- summarise(per_month, flights = sum(flights)))
## # A tibble: 1 x 2
## year flights
## <int> <int>
## 1 2013 336776
Use ungroup().
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
## # A tibble: 1 x 1
## flights
## <int>
## 1 336776
# A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
flights %>%
group_by(flight) %>%
summarise(early_15_min = sum(arr_delay <= -15, na.rm = TRUE) / n(),
late_15_min = sum(arr_delay >= 15, na.rm = TRUE) / n()) %>%
filter(early_15_min == 0.5,
late_15_min == 0.5)
## # A tibble: 18 x 3
## flight early_15_min late_15_min
## <int> <dbl> <dbl>
## 1 107 0.5 0.5
## 2 2072 0.5 0.5
## 3 2366 0.5 0.5
## 4 2500 0.5 0.5
## 5 2552 0.5 0.5
## 6 3495 0.5 0.5
## 7 3518 0.5 0.5
## 8 3544 0.5 0.5
## 9 3651 0.5 0.5
## 10 3705 0.5 0.5
## 11 3916 0.5 0.5
## 12 3951 0.5 0.5
## 13 4273 0.5 0.5
## 14 4313 0.5 0.5
## 15 5297 0.5 0.5
## 16 5322 0.5 0.5
## 17 5388 0.5 0.5
## 18 5505 0.5 0.5
# A flight is always 10 minutes late.
flights %>%
group_by(flight) %>%
summarise(late_10 = sum(arr_delay == 10, na.rm = TRUE) / n()) %>%
filter(late_10 == 1)
## # A tibble: 4 x 2
## flight late_10
## <int> <dbl>
## 1 2254 1
## 2 3656 1
## 3 3880 1
## 4 5854 1
# A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
flights %>%
group_by(flight) %>%
summarise(early_30_min = sum(arr_delay <= -30, na.rm = TRUE) / n(),
late_30_min = sum(arr_delay >= 30, na.rm = TRUE) / n()) %>%
filter(early_30_min == 0.5,
late_30_min == 0.5)
## # A tibble: 3 x 3
## flight early_30_min late_30_min
## <int> <dbl> <dbl>
## 1 3651 0.5 0.5
## 2 3916 0.5 0.5
## 3 3951 0.5 0.5
# 99% of the time a flight is on time. 1% of the time it's 2 hours late.
flights %>%
group_by(flight) %>%
summarise(on_time = sum(arr_delay == 0, na.rm = TRUE) / n(),
late_2_hours = sum(arr_delay >= 120, na.rm = TRUE) / n()) %>%
filter(on_time == .99,
late_2_hours == .01)
## # A tibble: 0 x 3
## # ... with 3 variables: flight <int>, on_time <dbl>, late_2_hours <dbl>
Which is more important: arr_delay or dep_delay?
Delay type importance depends on individual preference. If an individual hates waiting in the terminal for the flight to take off, then dep_delay is more important and vice versa.
not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance). No using count().not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
# original
not_cancelled %>%
count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
# new
not_cancelled %>%
group_by(dest) %>%
summarise(n = n())
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
# original2
not_cancelled %>%
count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
# new2
not_cancelled %>%
group_by(tailnum) %>%
summarise(n = sum(distance, na.rm = TRUE))
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?There are no flights which arrived but did not depart, so we can just use !is.na(dep_delay).
flights %>%
filter(is.na(dep_delay)) %>%
count(day)
## # A tibble: 31 x 2
## day n
## <int> <int>
## 1 1 246
## 2 2 250
## 3 3 109
## 4 4 82
## 5 5 226
## 6 6 296
## 7 7 318
## 8 8 921
## 9 9 593
## 10 10 535
## # ... with 21 more rows
flights %>%
group_by(day) %>%
summarise(prop_canceled = sum(is.na(dep_delay)) / n(),
avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = avg_delay, y = prop_canceled)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
flights %>% group_by(carrier, dest) %>% summarise(n()))# worst delays
flights %>%
group_by(carrier) %>%
summarize(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_delay))
## # A tibble: 16 x 2
## carrier mean_delay
## <chr> <dbl>
## 1 F9 21.9207048
## 2 FL 20.1159055
## 3 EV 15.7964311
## 4 YV 15.5569853
## 5 OO 11.9310345
## 6 MQ 10.7747334
## 7 WN 9.6491199
## 8 B6 9.4579733
## 9 9E 7.3796692
## 10 UA 3.5580111
## 11 US 2.1295951
## 12 VX 1.7644644
## 13 DL 1.6443409
## 14 AA 0.3642909
## 15 HA -6.9152047
## 16 AS -9.9308886
# challenge: bad airports vs. bad carriers
flights %>%
group_by(carrier, dest) %>%
summarize(mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
group_by(carrier) %>%
summarize(mean_delay_mad = mad(mean_delay, na.rm = TRUE)) %>%
arrange(desc(mean_delay_mad))
## # A tibble: 16 x 2
## carrier mean_delay_mad
## <chr> <dbl>
## 1 VX 12.390156
## 2 OO 10.519400
## 3 YV 8.974067
## 4 9E 8.197407
## 5 EV 7.094112
## 6 DL 7.002298
## 7 UA 5.043940
## 8 US 5.034137
## 9 B6 4.995649
## 10 WN 4.506001
## 11 AA 3.311529
## 12 MQ 2.879322
## 13 FL 1.551060
## 14 AS 0.000000
## 15 F9 0.000000
## 16 HA 0.000000
sort argument to count() do. When might you use it?The sort argument sorts the results of count() in descending order of n. Might use this if plan to use arrange(), will save a line of code.
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 7
## # Groups: year, month, day [365]
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 853 851 184 41
## 2 2013 1 1 290 338 1134 213
## 3 2013 1 1 260 263 266 46
## 4 2013 1 1 157 174 213 60
## 5 2013 1 1 216 222 708 121
## 6 2013 1 1 255 250 589 115
## 7 2013 1 1 285 246 1085 146
## 8 2013 1 1 192 191 199 44
## 9 2013 1 1 379 456 1092 222
## 10 2013 1 2 224 207 550 94
## # ... with 3,296 more rows
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
## # A tibble: 332,577 x 19
## # Groups: dest [77]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 332,567 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups: dest [77]
## year month day dest arr_delay prop_delay
## <int> <int> <int> <chr> <dbl> <dbl>
## 1 2013 1 1 IAH 11 1.106740e-04
## 2 2013 1 1 IAH 20 2.012255e-04
## 3 2013 1 1 MIA 33 2.350026e-04
## 4 2013 1 1 ORD 12 4.239594e-05
## 5 2013 1 1 FLL 19 9.377853e-05
## 6 2013 1 1 ORD 8 2.826396e-05
## 7 2013 1 1 LAX 7 3.444441e-05
## 8 2013 1 1 DFW 31 2.817951e-04
## 9 2013 1 1 ATL 12 3.996017e-05
## 10 2013 1 1 DTW 16 1.157257e-04
## # ... with 131,096 more rows
tailnum) has the worst on-time record? “on-time” -> arriving within 30 minutes of sched arrival.flights %>%
group_by(tailnum) %>%
summarize(prop_on_time = sum(arr_delay <= 30, na.rm = TRUE) / n(),
mean_arr_delay = mean(arr_delay, na.rm = TRUE),
flights = n()) %>%
arrange(prop_on_time, desc(mean_arr_delay))
## # A tibble: 4,044 x 4
## tailnum prop_on_time mean_arr_delay flights
## <chr> <dbl> <dbl> <int>
## 1 N844MH 0 320 1
## 2 N911DA 0 294 1
## 3 N922EV 0 276 1
## 4 N587NW 0 264 1
## 5 N851NW 0 219 1
## 6 N928DN 0 201 1
## 7 N7715E 0 188 1
## 8 N654UA 0 185 1
## 9 N427SW 0 157 1
## 10 N136DL 0 146 1
## # ... with 4,034 more rows
flights %>%
group_by(hour) %>%
summarize(arr_delay = sum(arr_delay > 5, na.rm = TRUE) / n()) %>%
ggplot(aes(x = hour, y = arr_delay)) +
geom_col()
Avoid flying in the evening to minimize your arrival delay.
flights %>%
group_by(origin) %>%
arrange(year, month, day, hour, minute) %>%
mutate(prev_dep_delay = lag(dep_delay)) %>%
ggplot(aes(x = prev_dep_delay, y = dep_delay)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 14383 rows containing non-finite values (stat_smooth).
## Warning: Removed 14383 rows containing missing values (geom_point).
ggplot(diamonds, aes(x)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(y)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(z)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# default binwidth
ggplot(diamonds, aes(price)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# binwidth = 100
ggplot(diamonds, aes(price)) +
geom_histogram(binwidth = 100) +
scale_x_continuous(breaks = seq(0, 20000, by = 1000))
There are far fewer diamonds priced at $1500 compared to other price points. This is not apparent using the default number of bins.
3.How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = .01) +
coord_cartesian(xlim = c(.97, 1.03))
More 1.00 carat diamonds than .99 diamonds. As it doesn’t make much sense to buy a .99 carat diamond if you can get a 1.00 carat diamond for a little bit more money.
# full plot
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
# xlim
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth() +
xlim(1, 3)
## `geom_smooth()` using method = 'gam'
## Warning: Removed 34912 rows containing non-finite values (stat_smooth).
## Warning: Removed 34912 rows containing missing values (geom_point).
# coord_cartesian
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth() +
coord_cartesian(xlim = c(1, 3))
## `geom_smooth()` using method = 'gam'
By using xlim() or ylim(), you remove all observations which exceed these values so they are not used to generate the plot. By using coord_cartesian(), those values are used to generate the plot and are merely cut off when zooming in. Note the change in the smoothing line in the xlim() example because it doesn’t have all the data points to calculate the line.
Recommend replacing unusual values with missing values. Easiest way to do it is to use mutate() and ifelse().
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
What does na.rm = TRUE do in mean() and sum()?
It strips missing values before computing the statistic.
# original chart
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
# revised chart
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(aes(x = sched_dep_time, y = ..density.., color = cancelled)) +
geom_freqpoly(binwidth = 1/4)
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
ggplot(diamonds, aes(cut, carat)) +
geom_boxplot()
Carat size is the most important predictor of price. On avg, fair and good cut diamonds are larger than premium and ideal cuts.
To create a horizontal layer in ggplot2 with coord_flip(), you have to supply aesthetics as if they were to be drawn vertically:
ggplot(diamonds, aes(cut, carat)) +
geom_boxplot() +
coord_flip()
# In ggstance, you supply aesthetics in their natural order
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
ggplot(diamonds, aes(carat, cut)) +
geom_boxploth()
# devtools::install_github("hadley/lvplot")
library(lvplot)
# with boxplot
ggplot(diamonds, aes(cut, price)) +
geom_boxplot()
# with lvplot
ggplot(diamonds, aes(cut, price)) +
geom_lv()
# geom_violin
ggplot(diamonds, aes(cut, price)) +
geom_violin()
# faceted geom_histogram
ggplot(diamonds, aes(price)) +
geom_histogram() +
facet_grid(. ~ cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# colored geom_freqpoly
ggplot(diamonds, aes(price, color = cut)) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In order to coerce a data frame into a tibble. You can do that with as_tibble():
as_tibble(iris)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fctr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
Can create a new tibble from individual vectors with tibble().
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
## # A tibble: 5 x 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
Another way to create a tibble is with tribble().
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
## # A tibble: 2 x 3
## x y z
## <chr> <dbl> <dbl>
## 1 a 2 3.6
## 2 b 1 8.5
The comment line #--|--|---- is useful to make it really clear where the header is.
Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. In addition to its name, each column reports its type, a nice feature borrowed from str():
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
## # A tibble: 1,000 x 5
## a b c d e
## <dttm> <date> <int> <dbl> <chr>
## 1 2017-09-04 04:13:46 2017-09-24 1 0.89127770 l
## 2 2017-09-03 20:36:31 2017-09-08 2 0.94658562 k
## 3 2017-09-03 17:52:42 2017-09-28 3 0.85563417 v
## 4 2017-09-04 08:21:48 2017-09-04 4 0.72182279 t
## 5 2017-09-03 21:23:24 2017-09-23 5 0.60327025 t
## 6 2017-09-03 17:34:15 2017-09-24 6 0.08274355 f
## 7 2017-09-03 17:55:49 2017-09-11 7 0.91840370 c
## 8 2017-09-04 09:50:24 2017-09-14 8 0.19719921 y
## 9 2017-09-04 12:53:39 2017-09-16 9 0.26916681 a
## 10 2017-09-04 06:46:19 2017-09-12 10 0.70053942 a
## # ... with 990 more rows
You can explictly print() the data frame and control the number of rows(n) and the width of the display. width = Inf displays all columns:
nycflights13::flights %>%
print(n = 10, width = Inf)
If you want to pull out a single variable, you need some new tools, $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing.
df <- tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x
## [1] 0.03238639 0.01334500 0.57868442 0.98030351 0.80002169
df[["x"]]
## [1] 0.03238639 0.01334500 0.57868442 0.98030351 0.80002169
# Extract by position
df[[1]]
## [1] 0.03238639 0.01334500 0.57868442 0.98030351 0.80002169
Can be used in a pipe. Only need to use the special placeholder .:
df %>% .$x
## [1] 0.03238639 0.01334500 0.57868442 0.98030351 0.80002169
df %>% .[["x"]]
## [1] 0.03238639 0.01334500 0.57868442 0.98030351 0.80002169
Some older functions don’t work with tibbles. Need to use as.data.frame()
#class(as.data.frame(tb))
A data frame will print the entire contents. A tibble will only print (by default) the first 10 rows and as many columns as will fit in the console.
data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?# on a data frame
df <- data.frame(abc = 1, xyz = "a")
df$x
## [1] a
## Levels: a
df[, "xyz"]
## [1] a
## Levels: a
df[, c("abc", "xyz")]
## abc xyz
## 1 1 a
# on a tibble
df <- tibble(abc = 1, xyz = "a")
df$x
## Warning: Unknown or uninitialised column: 'x'.
## NULL
df[, "xyz"]
## # A tibble: 1 x 1
## xyz
## <chr>
## 1 a
df[, c("abc", "xyz")]
## # A tibble: 1 x 2
## abc xyz
## <dbl> <chr>
## 1 1 a
[[ will always return a tibble; subsetting data frames using [[ can potentially return a vector.var <- "mpg", how can you extract the reference variable from a tibble?var <- "hwy"
mpg[[var]]
## [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17
## [24] 26 23 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21
## [47] 23 23 19 18 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16
## [70] 12 15 16 17 15 17 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25
## [93] 26 24 21 22 23 22 20 33 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28
## [116] 26 29 28 27 24 24 24 22 19 20 17 12 19 18 14 15 18 18 15 17 16 18 17
## [139] 19 19 17 29 27 31 32 27 26 26 25 25 17 17 20 18 26 26 27 28 25 25 24
## [162] 27 25 26 23 26 26 26 26 25 27 25 27 20 20 19 17 20 17 29 27 31 31 26
## [185] 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18 20 20 22 17 19 18 20
## [208] 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26 28 29 29 29 28
## [231] 29 26 26 26
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
# 1 Extracting the variable called `1`.
annoying %>% .$`1`
## [1] 1 2 3 4 5 6 7 8 9 10
# 2 Plotting a scatterplot of `1` vs `2`.
ggplot(annoying, aes(`1`, `2`)) +
geom_point()
# 3 Creating a new column called `3` which is `2` divided by `1`
(annoying <- mutate(annoying, `3` = `2` / `1`))
## # A tibble: 10 x 3
## `1` `2` `3`
## <int> <dbl> <dbl>
## 1 1 2.655775 2.655775
## 2 2 4.902608 2.451304
## 3 3 5.031278 1.677093
## 4 4 10.078599 2.519650
## 5 5 11.501359 2.300272
## 6 6 13.439278 2.239880
## 7 7 15.721971 2.245996
## 8 8 16.460872 2.057609
## 9 9 19.249798 2.138866
## 10 10 19.624944 1.962494
# 4 Renaming the columns to `one`, `two` and `three`
rename(annoying,
one = `1`,
two = `2`,
three = `3`)
## # A tibble: 10 x 3
## one two three
## <int> <dbl> <dbl>
## 1 1 2.655775 2.655775
## 2 2 4.902608 2.451304
## 3 3 5.031278 1.677093
## 4 4 10.078599 2.519650
## 5 5 11.501359 2.300272
## 6 6 13.439278 2.239880
## 7 7 15.721971 2.245996
## 8 8 16.460872 2.057609
## 9 9 19.249798 2.138866
## 10 10 19.624944 1.962494
tibble::enframe() do? When might you use it?enframe() is a helper function that converts named atomic vectors or lists to two-column data frames. You might use it if you have data stored in a named vector and you want to add it to a data frame and preserve both the name attribute and the actual value.
getOption("tibble.max_extra_cols")